A hierarchically clustered heatmap representing the correlation coefficient of each CRC TF with each other. Above is displayed the correlation of non CRC proteins/marks with CRC TFs.
similar plot computed on binarized signal (whether a peak is present at a location or not)
How many of peak in A (column) overlaps with peak in B (rows)
Overlap of ROW proteins peaks in COLUMN protein peaks. A hierarchically clustered heatmap where 1 represent 100% overlap and 0, 0%. (e.g. E2F3 binds almost always (>80% of the time) under SREBF1. But SREBF1 only does so ~10% of the time.) The bottom-left side square are the CRC (clustered). At the top and right side has been added additional signals of histone marks/state and non CRC proteins (non clustered). The plot is non symmetrical as a protein can bind in many more places than another and thus cover all its binding regions while binding in many other places.
Here we display the correlation on binding signal, only on regions where the two proteins/signals overlap. It allows us to recover signal that would be hidden for sets of TF that would only co-bind in few places.
Log-enrichments given for each TF and chromatin mark over each other. Enrichment is computed by doing a fisher's exact test on the expected vs observed presence of ROW protein peaks under COLUMN protein peaks. Pvalues are then corrected for multiple hypothesis testing using the BH method. (e.g. here CTCF is very strongly enriched in SMC1 but SMC1 is not as much)
(see above for information) this plot shows the same enrichments as above, except that it shows a decreasing square size for any pvalue above $10^{-9}$. hovering shows enrichment (value) and pvalue (pval) for each protein/mark pair
I have tried gaussian mixtures and Agglomerative clustering algorithm. Only the second can create a hierarchical clustering.
It seems that gaussian mixture makes more sense given the data we have, for now, is more "homogeneous".
I am still not so happy with the clustering. It can be because of the how much importance, outlier values and the high number of noisy values from locations with no peaks.
We can use similar methods to RNAseq to improve this (clamping values, log transform, first round of PCA..)
Heatmap of the centroid location of each cluster. centroids represent the center of mass of a cluster in kmean clustering. Here a cluster having a value of 1 means that the center of mass is completely shifted toward that particular TF. It showcase in what type of enhancers the cluster is clustering on (archetypical cobindings). This is very sensitive to the number of clusters and scaling applied
the cobinding matrix, clustered using the kmeans clusters. (see cobinding matrix description)
log-enrichment of each TF/chromatin marks under each cluster (see previous enrichment plot for description) enrichment are filtered for adjusted p value below $10^{-3}$
pvalue will be informed by opacity
clustered cobinding matrix where signal for non CRC peaks has been replaced by enrichment of those over the clusters and where signal for CRC peaks has been averaged over the clusters
Kohonen Self-Organizing Maps. Each nodes learn to recognize a pattern in the dataset, unregarding of its distribution. Each node tries to be as similar as possible to its close neighboor and dissimilar to nodes further away. This allows to reduce dimensionality of the dataset, unregarding of distances. This makes a lot of sense for our data type. we end up with a map of 400 datapoints which represent the binding code in the dataset.
displays a map of each node, respecting 2D neighboorhood. and its enrichment for a specific protein/mark.
displays similarity in what the nodes have learnt. how different the sum of their weight is compared to their neighboors.
displays the same differential map as above. but allows the user to look at the key TF/marks that are enriched in the node (might be more than one) this is very sensitive on the filter used to defined what is enriched and what is not.
Which node is mostly enriched in which specific signal? the color represent the amount of points mostly recognized that node. the more points, the stronger the enrichment.
Here we have which node recognize signal from super enhancers
A density plot where each point represent 2Dimensional hexagonal bin and the intensity represents the accumulaation of datapoints in these bins. The datapoints represent a random distribution of ~10% of all peaks of the cobinding matrix. They dimensionality reduced using TSNE over the scaled signal of bound TF on these peaks.
These islands and clusters represent categories / types of enhancers defined by their binding code.